Model Selection

Multimodal Pre-training

# Multimodal Pre-training

YOLOE is a real-time visual omni-model that supports various vision tasks including zero-shot object detection.

Object Detection

YOLOE is a real-time visual omni-model that combines object detection and visual understanding capabilities, suitable for various visual tasks.

Object Detection

YOLOE is a zero-shot object detection model capable of detecting various objects in visual scenes in real-time.

Object Detection

Aimv2 Huge Patch14 224.apple Pt

AIM-v2 is an efficient image encoder implemented based on the timm library, suitable for image feature extraction tasks.

Image Classification

Aimv2 3b Patch14 224.apple Pt

AIM-v2 is an efficient image encoder model compatible with the timm framework, suitable for computer vision tasks.

Image Classification

Vit So400m Patch14 Siglip 378.webli

A vision Transformer model based on SigLIP, containing only an image encoder, utilizing the original attention pooling mechanism.

Image Classification

Vit Large Patch16 Siglip Gap 384.webli

A vision Transformer model based on SigLIP, utilizing global average pooling, suitable for image feature extraction tasks.

Image Classification

Vit Base Patch16 Siglip 384.webli

Vision Transformer model based on SigLIP, containing only the image encoder part, using original attention pooling mechanism

Image Classification

Vit Base Patch16 Siglip 224.webli

Vision Transformer model based on SigLIP, containing only the image encoder part, using original attention pooling mechanism

Image Classification

Vit Large Patch14 Clip 224.laion2b

Vision Transformer model based on CLIP architecture, specialized in image feature extraction

Image Classification

Aimv2 Large Patch14 Native Image Classification

AIMv2-Large-Patch14-Native is an adapted image classification model, modified from the original AIMv2 model to be compatible with Hugging Face Transformers' AutoModelForImageClassification class.

Image Classification

Vit Base Patch32 Clip 224.metaclip 400m

A vision-language model trained on the MetaCLIP-400M dataset, supporting zero-shot image classification tasks

Image Classification

Vit Base Patch32 Clip 224.laion2b E16

Vision Transformer model trained on the LAION-2B dataset, supporting zero-shot image classification tasks

Image Classification

Openclip Resnet50 CC12M

OpenCLIP model based on ResNet50 architecture and trained on the CC12M dataset, supporting zero-shot image classification tasks.

Image Classification

Wav2vec2 Base Audioset

Audio representation learning model based on HuBERT architecture, pre-trained on the complete AudioSet dataset

Audio Classification

FoodSeg103 is a dataset containing 7,118 food images, annotated with 104 ingredient categories, with an average of 6 ingredient labels and pixel-level masks per image.

Image Segmentation

Eva Giant Patch14 Clip 224.laion400m S11b B41k

A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks

Eva02 Large Patch14 Clip 336.merged2b S6b B61k

EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.

Pix2struct Base

Pix2Struct is an image encoder-text decoder model trained on various image-text pairs for tasks including image captioning and visual question answering.

Transformers Supports Multiple Languages

Chinese Clip Vit Large Patch14 336px

Chinese CLIP is a simplified implementation of CLIP based on approximately 200 million Chinese image-text pairs, using ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.

Taiyi Stable Diffusion 1B Chinese EN V0.1

The first open-source Chinese-English bilingual Stable Diffusion model, trained on 20 million filtered Chinese image-text pairs

Text-to-Image Chinese

Xclip Base Patch16 Ucf 2 Shot

X-CLIP is a minimalist extension of CLIP for general video-language understanding. The model is trained on (video, text) pairs through contrastive learning.

Transformers English

Layoutlmv3 Large Finetuned Funsd

A fine-tuned version of the LayoutLMv3-large model based on the FUNSD dataset, specializing in document intelligence understanding tasks

Text Recognition

Layoutlmv3 Base Finetuned Funsd

A document AI model fine-tuned on the FUNSD dataset based on the LayoutLMv3-base model, designed for form understanding tasks

Text Recognition

Layoutlmv2 Large Uncased Finetuned Vi Infovqa

A document visual question answering model fine-tuned based on microsoft/layoutlmv2-large-uncased, suitable for Vietnamese information extraction tasks

Bros Large Uncased

BROS is a pre-trained language model focusing on text and layout, designed to better extract key information from documents.

Large Language Model

naver-clova-ocr

Gpt2 Chinese Poem

A Chinese classical poetry generation model based on the GPT2 architecture, pre-trained by UER-py, capable of generating Chinese classical poetry.

Large Language Model Chinese

Markuplm Large Finetuned Qa

This model is a question-answering model fine-tuned based on Microsoft's MarkupLM architecture, specifically designed for handling Q&A tasks combining web markup languages (HTML/XML) with text.

Multimodal Fusion

FuriouslyAsleep

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase